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This paper presents a novel way to approximate a distribution governing a system of coupled 
particles with a product of independent distributions. The approach is an extension of mean field 
theory that allows the independent distributions to live in a different space from the system, and 
thereby capture statistical dependencies in that system. It also allows different Ifamiltonians for each 
independent distribution, to facilitate Monte Carlo estimation of those distributions. The approach 
leads to a novel energy-minimization algorithm in which each coordinate Monte Carlo estimates an 
associated spectrum, and then independently sets its state by sampling a Boltzmann distribution 
across that spectrum. It can also be used for high-dimensional numerical integration, (constrained) 
combinatorial optimization, and adaptive distributed control. This approach also provides a simple, 
physics-based derivation of the powerful approximate energy- minimization algorithms semi- formally 
derived in 8, 9, 11]. In addition it suggests many improvements to those algorithms, and motivates 
a new (bounded rationality) game theory equilibrium concept. 

PACS numbers: 89.20.Ff,89.75.-k , 89.75.Fb 
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I. INTRODUCTION 

Mean Field Approximations (MFA's) have many ap- 
plications outside of physics, combinatorial optimization 
by mean-field annealing being perhaps the most promi- 
nent of them 1^ 1^. One problem with MFA's is that 
because they treat all particles as independent, they can 
be a poor approximation of the desired distribution. An- 
other is that especially in large systems with complex 
Hamiltonians, they can be expensive to evaluate. 

We extend MFA's by introducing coordinate transfor- 
mations that allow us to convert a coupled distribution 
into a decoupled one. This allows us to improve the ac- 
curacy of the MFA. It also allows us to employ Monte 
Carlo mean-field annealing for optimal adaptive control. 
We also introduce separate Hamiltonians for each parti- 
cle. This allows us to exploit recent results in collectives 
theory Q to speed up MFA evaluations. This extended 
MFA justifies heuristics recently found to beat simulated 
annealing in adaptive control problems by orders of mag- 
nitude H,0,O|. It also provides novel ways to do (con- 
strained) combinatorial optimization, a novel (bounded 
rational) game theory equilibrium concept P, 0] , and a 
way to calculate such equilibria. 



II. SELF-CONSISTENT DISTRIBUTIONS 

Let ^ be a space with elements z and cardinality |C|- 
A semi- coordinate system of C, S, is a space ^ having 
elements x = {xi € S,i,...,xm G £,m) together with a 
surjective map from ^ to C, written as C(.). C(.) need not 
be invertible; the (semi-)coordinates of a point x ^ may 
not be unique. No a priori restriction is made on whether 
^ and C are countable, uncountable, time-extended, etc. 

The space of possible probability distributions (or 
density functions, as the case may be) over ^ is V'' . 



Any p e T't induces a distribution over C, p{z) = 
/ dx p{x)S{z — ({x)). (We imphcitly assume integral 
measures and delta functions that match ^.) Expressions 
like "p(0 = "K = P'r and = where 

p,p' e T'^ mean Vx € p{x) = p'ix), p{xi) = p'{xi), 
and p{xj) — p'{xj) Vj 7^ i, respectively. V is the set of all 
product distributions over ^, i.e., the submanifold of all 
p obeying p{£_) = YiiPi^i)- We will use the obvious 
parameterization of elements of V as the vectors of their 
marginal distributions, written g — (qi(Ci), ■ • • ,9m(Cm))- 
Note that changing ({.) with fixed will change the man- 
ifold V in general but won't affect Q, the space of all q. 

We are interested in self-consistent q that obey qi{^i) ~ 
Ai{£_i,q^.^) V« for some functions {Ai}. By Brouwer's 
theorem, for any smooth {Ai}, the map q — > {Ai{£,i, q^.^ )} 
(which we call parallel Brouwer updating) has at least 
one fixed point. Here we consider such Ai that set qi 
by minimizing a functional of {qi,q^.^). As an example, 
that functional might be the same for all i, for each one 
measuring the error of {qi,q^.^) as an approximation of 
some p S V'^ . In this case the fixed point(s) of the {Ai} 
are locally optimal approximations in V of that p. 

Write the cross entropy from p to p' as S(j) \\ p') = 
— J dx p(a;)ln[^^^], where as usual is a prior probabil- 
ity over ^ that ensures the argument of the logarithm 
is unitless. So the entropy of p is just S{p \\ p) = 
S{p), and the KuUback-Leibler distance from p to p' 
is KL{p II /) = S{p II p') - S{p) [? ]. Next de- 
fine private Hamiltonians hij where ^(x) — Ci^') ^ 
hij{x) = hij{x') \/i,j,x and x', so expected values of 
a Hamiltonian have the same value in ( and ^. We 
write hi{^) = X^jl'i where the temperatures 
{/3ij } are usually non- negative and write the associated 
free energies as Fh.{p) = J dx p{x)hi{x) — S{p). So 
argmin^'gg.F/j; (g^, g^.j ) is the q[ maximizing the entropy 
of 'l'i{^i)Y\k^i1k{£,k), subject to expected values of i's 
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private Hamiltonians implicit in i's temperatures. The 
associated fixed points obey 



where Ni{[U]i^p) is the partition fimction and the bracket 
notation indicates expectation conditioned on the value 
of the subscripted coordinate: for any U and p ^ 
[U]t,piS,i) = /dx(,)C/(C^,a;(,))p(x(i) | ^0- 

We can choose {Ai} based on objective functions other 
than the free energy; the important thing is to incorpo- 
rate a concave function like entropy, to avoid the expen- 
sive process of checking the borders of Q for fixed points. 
For example, all the local minima of Fh^ (q) over Q are 
interior to Q. Moreover, for any fixed q^.^ , there is only 
one local minimum over Q^. 

The following six examples all have uniform fJ-- S, — C 
(as in conventional MFA's) in all but the fifth example: 

Example 1: Let rrii = 1 Then at each coordinate 
i obeys its own canonical ensemble and is independent of 
the values of the other coordinates. It is coupled to its 
own heat bath, with its own effective Hamiltonian set by 
the distributions over the other coordinates, (6)- 
In contrast, if \/i,mi = 2 and hi^2 — each qf(S,i) 

is a Gaussian over the values [hi\^i{£,i) rather than a 
Boltzmann distribution over them. In this case Pi^i and 
Pi,2 specify the mean and variance of hi^i. 

Example 2: We have a team game when and j, 

nii = mil , and hi = hii . In a team game we have a 
single shared objective function trading off entropy and 
expected values of the Hamiltonians. For m — 1 we de- 
fine the world Hamiltonian H as the single shared private 
Hamiltonian. For this the minimizer of the free energy 
over all 'P^ is the joint canonical ensemble distribution, 
pPH Jn contrast, q^^ , the minimizer over V, is an 
MFA for p'^^. Since C is arbitrary, by gradually increas- 
ing /3 one can converge on a g that is a delta function 
about the z that minimizes H{z). Since C is arbitrary, 
this optimization algorithm can be applied for any under- 
lying space one wishes to search. In particular, parallel 
Brouwer updating with gradually increasing f3 is equiva- 
lent to mean- field annealing 6]. 

In general q^^{£,i) ^ p'^^i^i), the marginal of p'^^id). 
However the q^^ with the lowest free energy is the q that 
best approximates p^^ , as measured by KL{q^^ ^p'^^) 
To have the fixed points equal the marginals we must 
instead have the {Ai} minimize KL{p^^ , q^^).) 

Example 3: One way to enforce a constraint f{z) = 
is via a generalization of penalty functions. Say we have 
an rn = 1 team game with world Hamiltonian H . Choose 
the objective function / dxp{x)[l3H{x) -\- /3/[/(C(a;))]^] — 
S(j)). Then for (3f — > oo our constraint will be en- 
forced, if possible. Often that solution is on Q's bor- 
der though, which slows the search — finite (if moves 
the solution(s) to the interior of Q, by weakening the 
constraint. Alternatively, choose the objective function 
/ dxp{x){l3H{x) + A[/(C(x))]2) -f - S{p), where 7 > 



is a constant. Minimizing over both p £ Q and A G M 
forces /(z) = if 7 = 0, while nonzero 7^ weakens the 
constraint but forces the fixed point q{s) inside Q. 

Example 4- Consider a team game with m — m' + 1 and 
hj{x) = hj{xj) e N+ Vj < m'. Here is an MFA for 
the grand canonical ensemble of a system with m' particle 
types and Hamiltonian hm'+i' Xi encodes the states of 
all type i particles, and hi counts their number 4]. 

Example 5: Say is bijective but ^ ^ C- Then 
is akin to a rotation, in that each couples multiple 
components of C. More generally, |^| ^ |C| means ^ is an 
M-dimensional representation of an M' < M-dimension 
system. This allows arbitrary distributions over C to be 
expressed as a product distribution. 
As an illustration, say C is two-dimensional with elements 
(zi,Z2)- Take ^1 = ^1, and have an additional ^ coordi- 
nate for each separate value of zi, which we write as 
^zj. So there are 1 -|- |Ci| coordinates altogether. Have 
each S^zx contain |^2| elements, i.e., separate subsets 
of and label those subsets by the values z^ G C,2- 
Have C,{xi,. . . ,xi+\c.^\) = {xi,XxJ, i.e., zi = xi and 
Z2 — Xz-i- So we can write p{zi) — X]22 "^2) = 
T,z2T,x:x,=zux.,=z2li'') = (lx^izl). Therefore p(z2 | 
zi) = 9(61 =^2 16= zi)- Sop{z2 I Zi) = qzi{z2)- 
This allows us to do optimal distributed adaptive control 
in C with an MFA over ^. Let C2 be the state of the plant 
one wants to control, i^i the control variable you can set, 
and H(C2) the objective function one wants to minimize. 
The goal in optimal control is to find &vgmiTiz^E{'H \ zi). 
Have the qi>2 fixed to P{C,2 I Ci); the distribution relat- 
ing the plant variable and the control variables. Take 
■nil = 1, /ii,i(0 = ^(C(0); ^-iid redefine Fh^ to in- 
volve the entropy of just ^1 rather than all of ^. Anneal 
/? — > c», so that gi(^i) becomes a single delta function. 
Write the resultant q^^ {£,) as 5^i,xxY\i>2lii^i)- This 
distribution minimizes X^x' ni>2 ~ 

^iA^')lxiWxi) = E2^(^)P(^2 I zi = xi). So 

zi = xi solves our optimal control problem. 

Example 6: Formally, when rrii — l^i and ^ = Ci each 
coordinate is a player in a non-cooperative game P, 0] , 
with X the players' joint move and {— their payoff 
functions. As the (3i 00, q^l^^'^^.^^ becomes a (normal 
form) mixed strategy Nash equilibria. 
Now consider the general bounded rational (i.e., non- 
Nash equilibrium) scenario. Say the players are told the 
entropy of the joint system, and each finds its best possi- 
ble mixed strategy subject to the other players' strategies 
and entropy value. Then the system is at a q^ . 
As an alternative interpretation of q^ , say we have a ra- 
tionality functional R{U,qi) that measures how peaked 
any qi is about argmax2;;J7(a;i) for any U : S^i ^ M.. We 
require that R{U,qi) = /3 if qi{d) oc A'(^i)e~^^(^'\ and 
that the qi satisfying R(U, qi) = (3 that has maximal 
entropy is ^{£,i)e~^^^^'-'i /Ni{U). Now say we are told 
the the rationalities of the M players for their 

associated effective Hamiltonians Then the 

information-theoretic optimal estimate for the associated 



3 



q is the minimizer over q and the {A^} of the free energy 
T.^H^i{R{[Kl\i,q^<li))-fiiK)\-S{q) for any monotoni- 
cally increasing functions {/i} 0. At any local minimum 
of this free energy q = q^ with f3i = (3* Mi. 
This use of rationality functionals expresses the bounded- 
rational solution to any non-team game as the minimizer 
of a single objective function whose local minima are 
all interior to Q. As an illustration, choose /i(/3) to 
be the ideal expected Hamiltonian dp\n{Ni{(3[hi^i\^ ^)), 

and R{U,qi) = argmin^ KL{qi \\ ■^-^^)- For such a 
choice is the actual expected Hamilto- 

nian, / dxi qiixi)[his]i ,jixi). 



III. FINDING FIXED POINTS 

When we have an to = 1 team game we can use 
variants of gradient descent to find minima of the (sin- 
gle) free energy. Another approach for such scenarios is 
parallel Brouwer updating. More generally, define the 
free energy gap at q for coordinate i as ln[A^i([ft,i]j ^g)] + 

J dxiq,{xi)[hi]^^^g{xi) + J dxiq,{xi)\n3i^. This is how 
much is reduced if only qi undergoes the Brouwer 
update. Define serial Brouwer updating as only updat- 
ing one qi at a time. In an to = 1 team game, any such 
update must reduce Ff3H (p) , in contrast to the case with 
parallel Brouwer updating. In greedy serial Brouwer up- 
dating, instead of cycling through all i, at each iteration 
we update only the coordinate with the largest gap; this 
maximizes the free energy drop in that update. 

A practical difficulty with these schemes for finding 
fixed points is that evaluating [hi]^ can be very difficult 
in large systems. An alternative is to use Monte Carlo 
simple sampling to get estimates of the effective Hamil- 
tonians, and use those to update p. In this scheme, given 
a q at iteration t, each Xi is separately set by randomly 
sampling qi{t), thereby generating x(t). Next the pair 
{xi(t), hi{x{t))) is combined with previous pairs and the 
update rule to set qi{t + l), and then the process repeats. 

To simplify the analysis, consider simple gradient de- 
scent of the free energy for the m = 1 team game. Have p 
be constant through each successive block of r timesteps, 
updating only when we go from some block m to block 
TO -|- 1, with the update based on observations during 
block TO. Say we have a team game and are at a block- 
transition, t = TTO-f-l, and let ni{t) G Vi^t) be all informa- 
tion the algorithm controlling qi has at that time, includ- 
ing the associated Monte Carlo samples. So we have a 
posterior conditional distribution of possible gradient de- 
scent directions P(-F^_^(q(i)) | ni{t)), where is 
the components involving coordinate i of the projection 
onto V of the free energy gradient VFH^p{q{t)). 

In gradient descent updating, that distribution should 
set the vector to add to qi{t) to get qi{t + 1). More 
precisely, since agent i knows qi{t), presuming quadratic 
loss refiects quality of the update, the Bayes-optimal es- 



timate of the gradient is the posterior expected gradi- 
ent, Jd[q^^^{t)] I n,) x Fljj^{q{t)). Expanding, 

the Xi component of Flj_p{q) is Ui{xi) - Y,x, "^(^^O/I^Ij 
where Ui(xi) = (3[H]^ ^g{xi) + \D[q{xi)]. Rather than 

evaluate the integral of [-ff]i^g(Ci)' '^^^ ^ maxi- 
mum likelihood estimator, i.e., replace that integral with 
{H)i^ni{^i), the average of the observed H values over the 
instances (of the just-completed block) when — Xi. 

Unfortunately, often in very large systems the con- 
vergence of {H)i^ni{£,i) is very slow, since the distri- 
bution sampled by the Monte Carlo to produce rii is 
very broad. To address this, posit that the differences 

G ^i} are unchanged 
when one replaces H with some hi. This means that 
qi{t) is unchanged by that replacement. The set of all 
hi guaranteed to have this character, regardless of the 
form of q{xi,(,^.^), is the set of all difference Hamiltoni- 
ans, hi{(^) — i?(^) — Z)i(^j.j) for some function Di. Now 
across block to we are sampling P{hi{^)) to generate n^, 
and then evaluating {hi)i^.{xi). The associated vari- 
ances of values (one for each of the x^), Wai{{hi)i^^.{xi)), 
govern the accuracy of the estimate of the free energy 
gradient. For well-chosen Di these variances may be far 
smaller than when hi = H . In particular, if the number 
of coordinates coupled to f i through H does not grow as 
the system does, often such difference Hamiltonian vari- 
ances will not grow much with system size, whereas the 
variances Var(7?i^y. (x^)) will grow greatly. Furthermore, 
very often such a difference Hamiltonian is far easier to 
evaluate than is H, due to cancellation in subtracting Di. 

More precisely, for practical reasons we want i's up- 
date algorithm to be robust against misperception of 
(/(.J . So for quadratic loss, assuming no t^^ = 0, consider 

the hi minimizing / dq^^, -?(?<,))[/ dnfP{n^ \ nf,q^.^,hi) 

{FhiqiO) - K,ili)V]^ where Piq,^^) reflects any prior 
information we might have concerning g^.j (e.g., that it 

is likely that the associated F'^{q) is close to that esti- 
mated for the previous block of r steps). Here F^. (qi) is 
our estimator for FJ^f^ (q) , and n" is the Hamiltonian val- 
ues contained in rii, the associated ^i values, n^' , being 
independent of hi and q^.^ and therefore fixed. 

The inner integral is a sum, of the (square of the) 
bias F^{qi) - F^{q) with the variance, / dnfP{n^ \ 
nf,q^^^,h,) {Fi^^iq,) - F^{qi)}\ where P{qi) = 

^dn'^P{nf I nf ,q^.^,hi)F^^{qi). By only considering 
difference utilities we guarantee that the bias equals 0. 
Now expand our variance of vectors as a sum of vari- 
ances of scalars, one for each value Xi. Since Ui is IID 
generated that sum is ^^|'^^^ Var((/ii)i^i,. (xi))/ra;. . 

The difference Hamiltonian minimizing this is H{s) — 
-1 

r / 

^ -1 H{x'i, x^.^ ) j9j]. (If all the terms in this sum 

cannot be stored because is too large, then nf must 
be averaged over as well, which can be approximated by 
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replacing the t^- with q{xi)T.) Being independent of q^.^ , 
this Hamiltonian minimizes our q^.^ integral, regardless of 
P{q^.^). For the same reason it is optimal if if the integral 
is replaced by a worst-case bound over q^.^ . 

An extensive series of experiments have been con- 
ducted with this optimal Hamiltonian under the approx- 
imation of uniform q{xi), and under the approximation 
that — for one and only one Xi value. These 
experiments compared this Hamiltonian with the team 
game, for many different H, under a variant of the par- 
allel Brouwer update rule |a that crudely corrected 
for non-stationarity. In that work these algorithms were 
semi-formally justified as ways to minimize H, with no 
appreciation for free energy, self-consistency, or the like. 
Indeed, due to the nonlinearity of the Brouwer update 
rule's dependence on [hil^^g, bad distortions may arise 

with the update used in that previous work, q{xi) 
^-0E{h,\n,.i,=x,)^Y^^^^-(}Eih,\n,.(,=x,)_ ^^g^^ there are 

no known assurances that small Var(/ii | nj,^^ = Xi) 
result in more accurate updating with parallel Brouwer 
updating, even when that updating is done correctly. De- 
spite these shortcomings, in the experiments in the 
approximations to the optimal Hamiltonian minimized H 
up to orders of magnitude faster than team games, with 
the improvement growing with problem size. 

IV. AVOIDING LOCAL MINIMA 

Say we have an m = 1 team game, are ultimately in- 
terested in annealing /3, and are currently at a local min- 
imum q^ G Q oi the shared free energy. Then to break 
out of that minimum we can simply raise (3 and restart 
the updating, since we want to raise (3 anyway, and in 
general doing so will change the Fh so that the Lagrange 
gaps become nonzero. 

A way to break free without changing the free energies 
is to switch to a coordinate system for ^, and thereby 
change V. As an example, can join two components 
of ^ into an aggregate coordinate. Since we can now have 
statistical dependencies between those two components. 



the space product distributions map to a superset of 
v. In general the local minima of that superset do not 
coincide with local minima of V. 

Less trivially, say — and ^(.) is the identity 
map for all but a few components, indicated as indices 
1 — > n. Have ^(.) be a bijection, so that for any fixed 
x'^j^.i^M ~ ^n+i^Mj the effect of the coordinate trans- 
formation is merely to "shuffle" the associated mapping 
taking coordinates 1 n to ^. Say we have a m = 1 

team game, and set <7„_|_]^^^j — q^+i^M- This means we 
can estimate the expectations of (3H conditioned on pos- 
sible from the Monte Carlo samples conditioned 
on So for any ^(.) we can estimate E{H) as 
J dxl^^p^\xl^jE{H I ?(a;?_._„)). Now entropy is the 
sum of the entropy of coordinates n + I M plus that 
of coordinates 1 ^ n. Accordingly, for any choice of ^(.) 
and we can approximate as (our associated 

estimate of) E{H) minus the entropy of Pl^^, minus a 
constant unaffected by choice of ^(.). 

So for finite and small enough |^i_>„|, we can use our 
estimates E{H \ ^(a;f_,„)) to search for the "shuffling" 

£,{.) and distribution (?^^„ that minimizes (penaliz- 
ing by the bias^ plus variance expression if we intend 
to do more Monte Carlo). The search can involve a se- 
ries of free energy descents over ^x^„ for each possible 

£,{.), or use cruder heuristics, like having qf^^ — ql^n^ 
and only varying ^(.). Not only should this coordinate 
transformation lower the free energy, it should also re- 
sult in a new surface through 7-"^ that is no longer at a 
local minimum. More generally, for arbitrary we can 
bound F^2 > F^i > F^2 — maxj.i [ln(^^2 '5{i(a;2),a;i)] (and 
similarly for uncountable ^^). So if after switching to 
we can then reduce F^2 to less than the value F^i had 
when we made the switch, then we know we have also 
reduced F^i to below that pre-switch value. An upper 
bound on the how large that drop in F^2 needs to be is 
max^i [MJ2x^^e(x^),x^)]- 

I would like to thank Chiu Fan Lee, Bill Macready, and 
Stefan Bieniawski for helpful discussion. 



[1] R.J. Aumann and S. Hart. Handbook of Game Theory 
with Economic Applications. North-Holland Press, 1992. 

[2] D. Challet and N. F. Johnson. Optimal combinations of 
imperfect objects. Phys. Rev. Let, 89:028701, 2002. 

[3] G. Korniss et al. Science, 299:677, 2003. 

[4] E. T. Jaynes. Information theory and statistical mechan- 
ics. Physical Review, 106:620, 1957. 

[5] H. Nishimori. Statistical physics of spin glasses and in- 
formation processing. Oxford University Press, 2001. 

[6] C. Peterson and J. R. Anderson. A mean-field theory 
learning algorithm for neural networks. Complex Sys- 
tems, 1:995, 1987. 

[7] E. W. Piotrowski and J. Sladkowski. An invitation to 



quantum game theory, /quant-ph/0211191, 2002. 
[8] D. Wolpert and K. Tumer. Beyond mechanism design. 
In H. Gao et al., editor, International Congress of Math- 
ematicians 2002 Proceedings. Qingdao Publishing, 2002. 
[9] D. H. Wolpert. Theory of collective intelligence. In 
K. Tumer and D. H. Wolpert, editors. Collectives and the 
Design of Complex Systems, New York, 2003. Springer. 

[10] D. H. Wolpert, K. Tumer, and E. Bandari. Improving 
search by using intelligent coordinates. 2003. submitted. 

[11] D. H. Wolpert, K. Wheeler, and K. Tumer. Collective 
intelligence for control of distributed dynamical systems. 
Europhysics Letters, 49(6), March 2000. 



